[v1][WIP] Metrics & Stats prototype #10651

rickyyx · 2024-11-26T02:32:02Z

Requirements and Goals

Stats collection should work with both the legacy and new multiprocess architecture.
Stats collection should be possible when engine core is busy in doing GPU work (or stuck at something) so we get observability metrics even when engine core is stuck.
Stats collection should take into account the entire lifecycle of a request, i.e. the input processing time.
Stats collection should incur minimal overheads on runtime/memory --> should likely be decoupled from the engine core loop.

Proposal and Design

Introducing RequestStats and RequestStatsUpdate to capture requests updates for metrics. It captures the entire request lifecycle (arrival -> input process -> engine core prefill/decode -> [preempt] -> detokenized).
Introducing a EngineStatsAgent that lives with the EngineCore to collect stats updates from various core components and scheduler/model outputs
Introducing a EngineStatsManager class that lives on the AsyncLLM engine that aggregates stats from the agent.
The manager and agent communicates through the ZMQ sockets for async stats collection.

Main Process                                  Engine Core Process
+----------------------+                      +----------------------+
|      AsyncLLM        |                     |    EngineCoreProc    |
|  +-----------------+ |                     |  +-----------------+  |
|  |EngineStatsManager| |                    |  |EngineStatsAgent |  |
|  +-----------------+ |                     |  +-----------------+  |
|                      |                     |                      |
|  [AsyncIO Event Loop]|     ZMQ Sockets     |  [Background Threads]|
|  +-----------------+ |                     |  +-----------------+  |
|  | Output Handler  |<|---------------------|->| Output Socket   |  |
|  +-----------------+ |                     |  +-----------------+  |
|                      |                     |                      |
|  +-----------------+ |                     |  +-----------------+  |
|  | Stats Handler   |<|---------------------|->| Stats Socket    |  |
|  +-----------------+ |                     |  +-----------------+  |
|                      |                     |                      |
|                      |                     |  +-----------------+  |
|                      |---------------------|->| Input Socket    |  |
|                      |                     |  +-----------------+  |
|                      |                     |                      |
|                      |                     |  [Main Thread]       |
|                      |                     |  +-----------------+  |
|                      |                     |  |   Busy Loop    |  |
|                      |                     |  +-----------------+  |
+----------------------+                      +----------------------+

On Backend (EngineCore Process):
- New EngineStatsAgent collects metrics in a thread-safe manner
- Stats are sent to frontend via ZMQ sockets
- Runs in separate background threads to minimize impact on model execution
On Frontend (Main Process):
- EngineStatsManager aggregates stats from backend as well as collect stats on the engine frontend.
- AsyncIO-based stats polling with configurable interval
- Compatible with existing V0 StatsLogger interface

Benchmark Results

TLDR: There's no observable perf regression when stats collection happens every 1 sec for the below workloads comparing to stats collection turned off.

Throughput

PR with stats logging every 1 sec: 13.7 requests/s
PR without: 13.7 requests/s

VLLM_USE_V1=1 python benchmarks/benchmark_throughput.py --input-len 1024 --output-len 512 --num-prompts 1000 --async-engine \
--disable-log-requests [--disable-log-stat]

Latency

============ Serving Benchmark Result ============
Metric                                  | Main       | PR with stats | PR without stats
----------------------------------------------------------------------------------------
Successful requests                     | 998        | 998           | 998
Benchmark duration (s)                  | 32.93      | 32.85         | 32.21
Total input tokens                      | 223244     | 223244        | 223244
Total generated tokens                  | 209009     | 209177        | 209029
Request throughput (req/s)              | 30.31      | 30.38         | 30.98
Output token throughput (tok/s)         | 6347.79    | 6366.81       | 6488.77
Total Token throughput (tok/s)          | 13127.92   | 13161.77      | 13418.80
---------------Time to First Token----------------
Mean TTFT (ms)                          | 3178.80    | 3134.97       | 3171.81
Median TTFT (ms)                        | 2980.97    | 3053.03       | 2960.46
P99 TTFT (ms)                           | 5466.19    | 5611.05       | 5377.22
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms)                          | 172.70     | 180.67        | 160.93
Median TPOT (ms)                        | 124.75     | 127.83        | 120.60
P99 TPOT (ms)                           | 438.38     | 599.12        | 409.44
---------------Inter-token Latency----------------
Mean ITL (ms)                           | 75.22      | 77.18         | 72.43
Median ITL (ms)                         | 36.60      | 36.04         | 36.26
P99 ITL (ms)                            | 852.84     | 908.12        | 806.74
==================================================

github-actions · 2024-11-26T02:32:14Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: rickyx <[email protected]>

vllm/v1/engine/async_llm.py

vllm/v1/engine/core.py

Signed-off-by: rickyx <[email protected]>

rickyyx · 2024-11-27T07:54:42Z

vllm/v1/engine/async_llm.py


-    # TODO: can we eliminate these?
+    async def _run_stats_handler(self):
+        while True:


This is where stats were polled from the backend.

rickyyx · 2024-11-27T07:55:50Z

vllm/v1/engine/core.py

                request_type = type_frame.buffer
                request_data = data_frame.buffer

+                if request_type == EngineCoreRequestType.STATS.value:


This is where the backend process responds. Note this is independent from the actual execution thread.

Signed-off-by: rickyx <[email protected]>

mergify · 2024-11-28T23:28:22Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @rickyyx.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: rickyx <[email protected]>

robertgshaw2-redhat · 2024-12-11T03:23:21Z

vllm/v1/core/scheduler.py

            free_encoder_input_ids=self.encoder_cache_manager.get_freed_ids(),
        )

+        if self.log_stats:


One thing that got us in trouble in V0 is too many loops over these types of data structures.

I wonder if the creation of the request stats could be done in the list comprehensions used above,

mergify · 2024-12-11T03:23:54Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @rickyyx.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

robertgshaw2-redhat · 2024-12-11T03:40:24Z

vllm/v1/engine/async_llm.py

        detokenizer_req, engine_core_req = self.processor.process_inputs(
            request_id, prompt, params, arrival_time, lora_request,
            trace_headers, prompt_adapter_request, priority)
+        self.stats_manager.record_engine_input(engine_core_req)


do we need this for LLMEngine as well?

oh yes - the prototype was only on async llm, but it should be the same for sync llm engine.

robertgshaw2-redhat · 2024-12-11T03:43:24Z

vllm/v1/stats/stats_manager.py

+            stat_logger.log(stats)
+
+
+class ThreadSafeEngineStatsManager(EngineStatsManager):


There are not any threads in the AsyncLLM class right now (its just using asyncio. So the Threadsafe variant is not needed.

Oh right - good catch. I wonder in the future we should probably isolate the stats logger in another thread since the stats logger could be pretty expensive (e.g. the prometheus logging is quite slow when it comes to histograms).

robertgshaw2-redhat · 2024-12-11T03:44:08Z

vllm/v1/engine/async_llm.py

+            # on the background thread.
+            # 2. the main event loop when a request is added (and input
+            # processed).
+            self.stats_manager = ThreadSafeEngineStatsManager(


There are not any threads in this class (just asyncio) --- so we don't need a threadsafe variant.

robertgshaw2-redhat

I think this design is on the right track. As discussed in slack (https://vllm-dev.slack.com/archives/C07QTE01QFQ/p1732588846571919), I think that we should remove the EngineStatsAgent from the EngineCore and instead send the stats back in EngineCoreOutputs each step.

This will simplify the state management in the EngineCore and reduce the number of RPCs we have to worry about with the same amount of work being executed on the core busy loop as the current design.

hmellor · 2025-03-11T17:31:33Z

Closing as stale, feel free to re-open if you feel this PR is still relevant.

init draft

0a666e7

Signed-off-by: rickyx <[email protected]>

rickyyx force-pushed the pr-v1-stats-logger branch from 626d716 to 0a666e7 Compare November 26, 2024 02:32

rickyyx mentioned this pull request Nov 26, 2024

[Feature][v1]: Add metrics support #10582

Closed

1 task

robertgshaw2-redhat reviewed Nov 26, 2024

View reviewed changes

vllm/v1/engine/async_llm.py Show resolved Hide resolved

robertgshaw2-redhat reviewed Nov 26, 2024

View reviewed changes

vllm/v1/engine/core.py Outdated Show resolved Hide resolved

rickyyx added 2 commits November 27, 2024 07:50

updated

6d9fedb

Signed-off-by: rickyx <[email protected]>

lint

26d08a9

Signed-off-by: rickyx <[email protected]>

rickyyx commented Nov 27, 2024

View reviewed changes

rickyyx added 2 commits November 27, 2024 19:19

Merge branch 'main' into pr-v1-stats-logger

8682b9d

update

e226c1f

Signed-off-by: rickyx <[email protected]>

mergify bot added the needs-rebase label Nov 28, 2024

rickyyx added 3 commits November 28, 2024 23:29

nit

6b9e3ae

Signed-off-by: rickyx <[email protected]>

updates

88bed10

merge conflict

ca2cc17

Signed-off-by: rickyx <[email protected]>

mergify bot removed the needs-rebase label Dec 4, 2024

rickyyx added 2 commits December 4, 2024 17:01

update

e8bd349

nits

84852df

Signed-off-by: rickyx <[email protected]>

rickyyx mentioned this pull request Dec 4, 2024

[v1][stats][1/n] Add RequestStatsUpdate and RequestStats types #10907

Merged

robertgshaw2-redhat reviewed Dec 11, 2024

View reviewed changes

mergify bot added the needs-rebase label Dec 11, 2024

robertgshaw2-redhat reviewed Dec 11, 2024

View reviewed changes

mergify bot added the v1 label Feb 5, 2025

hmellor closed this Mar 11, 2025

		stat_logger.log(stats)


		class ThreadSafeEngineStatsManager(EngineStatsManager):

Uh oh!

[v1][WIP] Metrics & Stats prototype #10651

[v1][WIP] Metrics & Stats prototype #10651

Conversation

rickyyx commented Nov 26, 2024 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Requirements and Goals

Proposal and Design

Benchmark Results

Throughput

Latency

Uh oh!

github-actions bot commented Nov 26, 2024

Uh oh!

Uh oh!

Uh oh!

rickyyx Nov 27, 2024

Choose a reason for hiding this comment

Uh oh!

rickyyx Nov 27, 2024

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Nov 28, 2024

Uh oh!

robertgshaw2-redhat Dec 11, 2024

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Dec 11, 2024

Uh oh!

robertgshaw2-redhat Dec 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rickyyx Dec 11, 2024

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat Dec 11, 2024

Choose a reason for hiding this comment

Uh oh!

rickyyx Dec 11, 2024

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat Dec 11, 2024

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hmellor commented Mar 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rickyyx commented Nov 26, 2024 •

edited by github-actions bot

Loading

robertgshaw2-redhat Dec 11, 2024 •

edited

Loading

robertgshaw2-redhat left a comment •

edited

Loading